Random Forest Analysis of Heart Failure Dataset

Tim Leschke, Pamela Mishaw, Sierra Landacre, Pallak Singh

Random Forest

  • Random Forest is a promising machine learning model because it correctly classifies data from large data sets, is resistant to outliers, and is easy to use.
  • Random Forest is a group of decision trees (a forest) that are created from identically distributed, independent random samples of data drawn with replacement from the original dataset (Breiman 2001).

Methods

  • single classification tree.

  • Random Forest uses multiple classification trees.

Methods

The Gini Index is referred to as a measure of node purity (James et al. 2021). It can also be used to measure the importance of each predictor. The Gini Index is defined by the following formula where K is the number of classes and \({\hat{p}_{mk}}\) is the proportion of observations in the mth region that are from the kth class. A Gini Index of 0 represents perfect purity.

\[D=-\sum_{n=1}^{K} {\hat{p}_{mk}}(1-\hat{p}_{mk})\]

Bagging is the aggregation of the results from each decision tree. It is defined by the following formula where B is the number of training sets and \(\hat{f}^{*b}\) is the prediction model. Although bagging improves prediction accuracy, it makes interpreting the results harder as they cannot be visualized as easily as a single decision tree (James et al. 2021).

\[{\hat{f}bag(x) = 1/B \sum_{b=1}^{B}\hat{f}^{*b}(x)}\]

Analysis and Results

  • We ingest the data into R-Studio
  • Perform classification with Random Forest
  • Perform analysis

Data Used

  • The dataset comes from the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad Pakistan.
  • There are 299 patient records with 13 features per record.
  • Random Forest model used for predicting heart failure events for patients.

Model Evaluation Metrics

Various metrics are used to assess the Random Forest model performance:

  • Out of Bag error rate/accuracy - OOB data is data left unused by a decision tree.
  • Confusion Matrix - shows True Positive, False Positive, False Negative and True Negative values to support performance evaluation
  • Precision - also known as sensitivity and it represents how many observations labeled positive are actually positive.
  • Recall - quantifies how many positive observations are actually predicted as positive.
  • F1 - harmonic mean of the precision and recall; assesses predictive performance.
  • Balanced accuracy - average accuracy of both true-positive and true-negative classes.
  • AUC-ROC - area under a curve created by the true positive rate vs. false positive rate.

Variable Importance Plot

FourFold Plot (Confusion Matrix)

Confusion Matrix Heatmap

Variable Correlation Heatmap

ROC - Default Values

trainControl() - Random Selection

Variable Importance Plot - Tuned

FourFold Plot (Confusion Matrix)

Kappa Plot

Tuned vs. Default ROC

Model Results

  • The testing accuracies of both models are slightly lower than their respective training accuracies which indicates overfitting of the training dataset.
  • The performance of model 2 is better than model 1. This is based off of the model’s training and testing accuracies; precision; F1; and balanced accuracy as they are all higher than those of the default model.

Conclusion

Software and Hardware Configuration

  • RStudio Pro 2023.12.0, Build 369.pro3
  • Various R libraries
  • RStudio Server running on RHEL9 based virtual machine within a VMware VSphere HA cluster. The VM has 50 vCPU’s and 196 GB ram assigned
  • Hardware includes Dell PowerEdge R750 servers with Dual Xeon Gold 6338N (32 core) CPUs, 512 GB RAM, and sfp28 25 gbit networking for all communications
  • Cluster storage is from an NVMe based Dell SAN.

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.
James, Gareth, Daniela Witten, Hastie Trevor, and Robert Tibshirani. 2021. An Introduction to Statistical Learning, 2nd Edition. Springer of New York.